Mining the Web for Bilingual Text

نویسنده

  • Philip Resnik
چکیده

STRAND Resnik is a language independent system for automatic discovery of text in parallel translation on the World Wide Web This paper extends the prelim inary STRAND results by adding automatic language identi cation scaling up by orders of magnitude and formally evaluating perfor mance The most recent end product is an au tomatically acquired parallel corpus comprising English French document pairs approxi mately million words per language

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Bilingual Data from the Web with Adaptively Learnt Patterns

Mining bilingual data (including bilingual sentences and terms 1 ) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval. In this paper, based on the observation that bilingual data in many web pages appear collectively following similar patterns, an adaptive pattern-based bilingual data mining method is proposed. Specifically, give...

متن کامل

Parallel Sentences Mining From The Web

Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences includ...

متن کامل

Designing a System for Trend Analysis of Users in Website Surfing in Iran Using Data Mining and Text Mining Algorithms

Background and Aim: As of the entrance of web surfing to the lifestyle of a vast majority of people in the society and the need for a more accurate social and cultural policy making in the field, authors intended to analyze the behavior of the society users in viewing different websites so as to help politicians and practitioners. Methods: Design science research method is used in this research...

متن کامل

A Scalable Approach to Building a Parallel Corpus from the Web

Parallel text acquisition from the Web is an attractive way for augmenting statistical models (e.g., machine translation, crosslingual document retrieval) with domain representative data. The basis for obtaining such data is a collection of pairs of bilingual Web sites or pages. In this work, we propose a crawling strategy that locates bilingual Web sites by constraining the visitation policy o...

متن کامل

Literature Review: Mining the Web for Parallel Text: The STRAND System

This paper presents a short review of mining the web for parallel texts with an emphasis on the STRAND system. In Section 2 we start by trying to broadly define what is meant by the word corpus. After that, in Section 3 we give an overview of the World Wide Web as a source for collecting corpora, followed (in Section 4) by a discussion on related copyright issues. We then review some articles t...

متن کامل

LiveTrans-Cross-Language Web Search through Live Mining of Query Translations

Enabling users to find effective translations automatically for query terms not included in dictionary is one of the major goals of a practical cross-language Web search service. This paper presents a cross-language Web search system called LiveTrans, which is an experimental metasearch engine that provides English-Chinese cross-lingual retrieval of both Web pages and images. The system has bee...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999